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Abstract 

Predictive recursion is an accurate and computationally efficient algorithm for 
nonparametric estimation of mixing densities in mixture models. In semiparamet- 
ric mixture models, however, the algorithm fails to account for any uncertainty in 
the additional unknown structural parameter. As an alternative to existing profile 
likelihood methods, we treat predictive recursion as a filter approximation to fitting 
a fully Bayes model, whereby an approximate marginal likelihood of the structural 
parameter emerges and can be used for inference. We call this the predictive re- 
cursion marginal likelihood. Convergence properties of predictive recursion under 
model mis-specification also lead to an attractive construction of this new proce- 
dure. We show pointwise convergence of a normalized version of this marginal 
likelihood function. Simulations compare the performance of this new marginal 
likelihood approach that of existing profile likelihood methods as well as Dirich- 
let process mixtures in density estimation. Mixed-effects models and an empirical 
Bayes multiple testing application in time series analysis are also considered. 

Keywords and phrases. Density estimation; Dirichlet process mixture; empirical 
Bayes; filtering algorithm; marginal likelihood; martingale; mixed effects model; 
multiple testing; profile likelihood. 



1 Introduction 

Consider data Yi, . . . ,Yn modeled as independent draws from a common, nonparametric 
mixture distribution with density 



^f{y) = p{y\ u)f{u) dfi{u), 2/ e ^, 
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where {y, u) ^ p{y | n) is a known kernel on x and / is an unknown mixing density 



i n F, t he set of densities with respect to a a-finite Borel measure n on . iNewton et al 



fll998l ) introduced the following stochastic algorithm, called predictive recursion, to esti- 
mate / and rrif. 

PR Algorithm. Choose an initial estimate /o G F of /, and a sequence of weights 
Wi, . . . ,Wn G (0, 1). For i = 1, . . . ,n, compute the following: 



= (1 - Wi)fi^i{u) + w. 



p(Yi I u)fi_i{u) 



JpiY,\u')f,.iXu')dfiiu'y 
P{y \ u)fi{u)dn{u), ye^. 



(2) 
(3) 



Return /„ and m„ as the estimates of / and m/, respectively. 



The algorithm's strengths include its fast computation and its unique flexibility to 
estimate the mixing density with respect to any user-specified dominating measure /i. 
Predictive rec u rsion a lso has a close connection to Dirichlet process mix t ure ra o dels; see 
Newton et aD f ll998h . lOuintana and Newtonl f l200oh . iNewton and Zhand f ll999h . [Newton 
( I2OO2I ). and Section 121 iTokdar et al.l ( 120091 ) show that when Yi, . . . ,Yn are generated inde- 
pendently from a density m which equals ruf* for some /* G F, the resulting estimates fn 
and rUn converge as n — )■ 00 , respe ctivel y and in appropriate topo lo gies, to /* and nif*; 
see al so iGhosh and Tokdarl (120061 ) and iMartin and Ghosh] (120081 ) . iMartin and Tokdar 
( I2OO9I ) show that if m does not equal rUf for any / G F, then the estimates still converge, 
but now the limits are characterized by the minimizer /* of the Kullback-Leibler diver- 
gence K{m,mf) = J log{m{y)/mf{y)}m{y) dy. The minimizer exists and is unique under 
certain conditions. An upper bound on the rate of this convergence is also available. 

In statistical applications, however, an exact description of the kernel piy \ u) is rarely 
available. It is more common to use a family of kernels p{y \ 6, u) indexed by a parameter 
6* G O and model Yi, . . . , as independent draws from a semiparametric mixture 



(4) 



"^fAv) = P{y\ 0,u)f{u)d^i{u) 



where both 9 and / are unknown. A frequently encountered example of this is density 
estimation with mixtures of Gaussian kernels, where p{y \ 6,u) = N{y \ u, 9"^) with 9 
playing the role of a bandwidth. A related formulation is in the linear mixed effects 
model Yi = Ui + X[I3 + cr£j, where 9 = {(3, a) is unknown, and the density / of the 
random effect Ui is not restricted to a parametric family. While 9 is more like a nuisance 
parameter in the density estimation problem, it takes center stage in the mixed effect 
mo del. In eit h er cas e, predictive recursion fails to provide any statistical analysis for 9. 



Tao et al.l (119991 ) counter this shortcoming by embedding predictive recursion in a 



profile likelihood framework. At any given 9, one runs the predictive recursion algorithm 
with kernel p{y \ 9, u) and a suitable initial guess /o,e to recursively compute fi^e and mj ^ 
for i = 1, . . . ,n. The final update m„e is then plugged in to give the following profile 
likelihood in 9: 



Ll{9) = llm^,e{Y^. 



(5) 
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Tao et al.l ( 119991 ) maximize this profile likelihood to estimate 9. Such a plug-in approach 



does not account for the lack of precision in estimating the mixing density /. In the 
density estimation setting, the profile likelihood may be maximized at the zero bandwidth, 
completely ignoring the extreme variability of the estimates of / at small bandwidths. 
Such undesirable behavior can be avoided by imposing a penalty on the estimate of /. 
But a general framework along these lines is yet to emerge, particularly for problems 
where inference on ^ is the main focus. 

In this paper we demonstrate that predictive recursion's close connection with the 
Bayesian paradigm offers a rich alternative to the plug-in approach. By viewing it as an 
approximation to fitting a fully Bayesian model on {0,f), it is natural to ask whether 
it can also provide an approximation to the marginal likelihood for 6 as defined by the 
Bayesian model. In Section [2] we show that such an approximation is indeed available 
and of the form 

n 

Lm = U'^^~l Ay^- (6) 

i=l 

The approximate marginal likelihood Ln{6), which we call the predicti ve recursion margina l 
likelihood, appears to inherit the intrinsic Ockham's razor properties (jjefferys and Berger 



19921 ) of the original Bayesian formulation. That is, the 6 values for which the conditional 



prior on / is more spread out automatically receive greater penalty. 

In Section [3] we show that ii Yi, . . . ,Yn are independent samples from a density m, 
then logL^'(6') equals —nmif^]^K{m,mf^0) plus a quantity that grows slower than n. 
A consequence of this is a convergence property of the maximum predictive recursion 
marginal likelihood estimate 

^„ = argmaxL^(^) (7) 

6»g0 

that follows from an argument similar to that of IWaldl (119491 ). Specifically, if 6 is finite, 
then 6n, converges to 6** as — )■ oo, where the limit is characterized by the minimizer 
{6*,f*) of K{m,mf g) over O x F. Our simulation studies suggest that similar results 
should hold for compact as well, but so far a proof has eluded us. 

An exact sampling distribution for 6n in ([7]) is not available. Therefore, for infer- 
ence on 6 we estimate the standard error of 6n via the curvature of at its maxima. 
This is motivated by the interpretation of the predictive recursion marginal likelihood as 
an approximate Bayesian m arginal likelihood for which Laplace approximation applies 
(ITierney and Kadandll986l ). 

Several examples are presented in Section |H For density estimation, our simulations 
indicate that closely approximates Bayes Dirichlet process mixture marginal likeli- 
hood, whereas is more sporadic, in some cases concentrating on the boundary of 
the parameter space. Applications to interval estimation in random-intercept regression 
models and multiple testing in mixtures of autoregressive process models are also given. 



Approximation to the Dirichlet process mixture 
marginal likelihood 



As noted in lNewton et al.l ( 1l998l ) and lNewtonI (120021 ) . the updating scheme ([2]) has a close 
connection with the posterior updates in a Bayesian formulation when / is modeled by 
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a Dirichlet process prior. In this section we further explore this connection to estabhsh 
Ln{6) as an approximation to the marginal likelihood of 6 as defined by such a Bayesian 
formulation. 

To be precise, consider the following extension of the mixture model (jl]): 



m{y) = mpfiiy) = J p{y \ 0,u) dF{u), (8) 

where F is an unknown probability measure on not necessarily dominated by /i. 
Consider a Bayesian formulation 



F\e ^Uo 



e 



(9) 



where lie is, for each 6* G O, a probability distribution over the space of probability 
measures F, and F is a probability distribution on G. The posterior distribution Tn of 9 
given the n observations can be written as dTn{0) oc L^{9) dr{6), where 



1=1 



is the marginal likelihood of 6 obtained by integrating out F from ([9]). For every 6 E Q, let 
Ui^g denote the conditional posterior distribution of F given 6 and the first i observations, 
i.e., dIli^g{F) oc {Ilfci dIlg{F). Then by linearity of mpg and Fubini's theorem. 



n 



Ll{e) = J] / mF,e{Y,)dY[,.i,e{F) = J] / p{Y, \ O.u) dF{u)dY[,^^,g{F) 
1=1 1=1 

n „ 

= n / P{y^\0,u)dF,^^,e{u), (10) 
i=i 

where Fi^ = J F dIli^g{F) is the conditional posterior mean of F given {Yi, . . . ,Yi, 9). 

Now consider the special case where 11^ = DP(ao, -Fq e), the Dirichlet process distribu- 

tion with precision parameter ag > and base measure Fq^ (jFergusonlll973l : iGhosh and Ramamoorthi 
20031 ). Assume that the base measures Fq£ are all absolutely continuous with respect to 
/i, admitting densities fn,^ = dFo^g/du £ F. It follows from the Polya urn representation 
of a Dirichlet process (IBlackwell and MacQueerull973l ) that 



dF^,e{u) 



a 



a + 1 



dFQfi{u) 



1 



p{Y^ I e, u) dFo,e{u) 



a 



1 JpiY, I e,u')dFo,eiu') 



Therefore, Fi^e is absolutely continuous with respect to /i and the density dFi^g/dfi G F 
is identical to the predictive recursion output /i ^ based on the single observation Yi, 
with initial guess /o,6», kernel p{y \ 9,u) and weight wi = 1/(1 + ao). Consequently 
L2{6) = L2{0) as can be verified by comparing ([HD and flTUl) . 

This analogy, however, does not carry over to Lf{0) and Lf{9) for z > 3. For z = 3, 
the relevant conditional posterior mean F2fi does not admit a representation as in (ITT]) 
in terms of Fi^g and p{Y2 \ 6, u) because the conditional posterior distribution Hi e is 
no longer a Dir ichlet process distribution, but rather a mixture of Dirichlet processes 
flAntoniaklllQTi ). 
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To remedy this, consider an approximation to the Bayesian model, where we succes- 
sively replace Ui^e with Ui^g = DP(aj, Fi^) where Fq^q = Fq^q and 

^ ^ / Fmj.,g(y,)dn,.i,e(F) 

is what one would obtain for E{F \ Yi, - ■ ■ ,Yi,6) if the conditional posterior of F given 
(Fi, . . . ,Yi_i,9) was indeed Hi-i^. These successive replacements can be thought of as 
a dynamic, mean preserving, filter approximation to the original Bayesian model. Note 
that every Fi^e remains absolutely continuous with respect to fi and satisfies the recursion 

( \ ^9 / 1 p{Yi\9,u)dFi^i^e{u) 



1 + tti-i ' 1 + "i-i / p(Yi I 6, u') dFi^i^eiW) 

This, coupled with the initial condition Fo^e = -^o,6», implies that the densities dFi^e/dfj, 
are precisely the fi^ that result from predictive recursion applied to the observations 
Yi,...,F„, with initial guess /o,6i, kernel | 6,u) and weights Wi = 1/(1 + 
Therefore the corresponding approximation L^{9) = YYi=i I P(Xi I ^5 ^) dFi^i^g{u) of L^{0) 
is exactly L^^(^). 

For every 6' G 6, the quantity HiLi '^i-i,6'(^i) indeed defines a joint probability density 
for [Yi, . . . , Yn) which admits L^n{0) as an exact likelihood function for 6. However it is 
unknown whether this joint density corresponds to any exchangeable hierarchical model 
on the Yi's, thus making it somewhat unsuitable to use it for statistical analysis. For this 
reason, we do not focus on studying L^X^) from the perspective of the joint model for 
which it is an exact likelihood function. We are instead interested in studying it as an 
inferential tool when y^'s are generated independently from a common density m which 
may or may not be a mixture as in (jlj). 

3 Asymptotic theory 

3.1 Notation and preliminaries 

For F, the set of all densities on ^ with respect to fi as in Section [H let F be its closure 
with respect to the weak topology. With a slight abuse of notation, the elements of F 
are also denoted by /, although th ey need not admit a densi ty with respect to For 



each 6, let M.g = {mf^g : / G F}. iMartin and Tokdarl (120091 ) show that the predictive 



recursion estimates mn,e converge, for each fixed 6, to the best mixture density in Me, if 
the following assumptions hold. 

Assumption 1. Observations Yi, ^2, • • • are independent with a common density m, and 
K{m, m') is finite for all m' G M = |Jgg@ Mg. 

Assumption 2. The weight sequence satisfies ^„Wn = oc and J2n'^n < 
Assumption 3. The set F is compact with respect to the weak topology. 
Assumption 4. The mapping u 1— )■ p{y \ 9, u) is bounded and continuous for all (y, 6). 
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Assumption 5. For each {61,62) pair, tliere exists A = ^(6*1, 6*2) < 00 such that 



sup 

ni,U2 



p{y\6i,ui) 

Piy\62,U2) 



m{y) dy < A. 



Assumption [2] is standard in the literature o n stochastic approximati on algorithms, 
of which predictive recursion is a special case ( iMartin and Ghosh! l2008l ). and it holds 
if Wn decays like n~'^ for 7 G (1/2,1]. Assumption [3] is satisfied if, for example, ^ 
is compact and /j, is Lebesgue measure. The more demanding Assumption [5], holds for 
many standard kernels p{y \ 6,u), such as those arising from Gaussian or other exponential 
family distributions, whenever is compact and m admits a moment-generating function 
on 

Define the mapping 



IC{6) = M{K{m, mj^e) ■ f e¥}, 6ee, 



(12) 



the smallest KuUback-Leibler diverge nce over M^. Attainment of the infimum in f[T2]) 
follows from Assumptions [3] and IH see Martin and Tokdarl ( 2009 ). Lemma 3.1. Let a„ = 
'^^=i'Wi denote the partial sums of the weight sequence {wn : n > 1}. For two real 
sequences a nd |/3nj, we write ctn = O (.0„) if Cin/ Pn is bounded, and a„ = o(/3„) if 
Cin/ l^n 0. Then iMartin and Tokdarl ( l2009l ) prove a version of the following theorem. 



Theorem 1. Under Assumptions\^^ K{m, rrin^g) — )■ K*{6) almost surely as n ^ 00 for 
each fixed 6. In addition to Assumptions U\{^ if the infimum in 0121] is attained in the 
interior of¥, and if J^n^nW^ < 00, then K{m,mn^e) — K*{6) = o{a~^) almost surely. 

For weights that satisfy Wn = 0{n~"'), the extra condition, '^^cinW^. < 00, in the 
second part of the theorem holds if and only i f 7 € (2/3, 1]. Therefore, th e best available 
rate in this case is o(n~^/^) almost surely. But Martin and Tokdarl (2009) argue that this 
rate is conservative. 



3.2 Main results 

Write £n = log and define the following normalization: 

Kn{6) = - > log — = , 13 

where £on = ^"=1 log"^(^) is the log joint density of Fi, . . . , Y^. We will show that the 
main result of Theorem [1] holds with Kn{6) in place of K{m,mn,e)- 

Assumption 6. For each 6, there exists B = Bg < 00, such that the density mf^e G 
at which the infimum in (IT2l) is attained satisfies 



log \ m{y) dy < B. 



If the mixture model is correctly specified, i.e., m G M, then Assumption |6] is a conse- 
quence of Assumption [5] and Jensen's inequality; see the argument for (120|) in Appendix 1. 
But this assumption seems reasonable even if m ^ M, as it assures that the proposed 
mixture model @ is, in a certain sense, close enough to the truth. 
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Theorem 2. Under Assumptions\^;\^ Kn{0) — ?■ K*{6) almost surely as n ^ oo for each 
fixed 6. In addition to AssumptionsU^{^ if the infimum in f|T2|) is attained in the interior 
of¥, and if Yln^nw"^ < oo, then Kn{0) — K*{6) = 0{a~^) almost surely. 

Proof. See Appendix 1. □ 

From Theorem [2] we can conclude that ^n(^) = —nK*{6) + £on + o(n). Therefore for 
any 6^1, 6*2 with K*{9i) ^ K*\Q'^, the difference — ^(^2) grows linearly in n. Such 

linear growth in log likelihood differences is the building block of Wald's famous proof of 
consistency of maximum likelihood estimators. In our case, we have the following. 

Theorem 3. Suppose 6 is a finite set and Assumptions 1-6 hold. Let G* = {9' : K*{9') = 
info Then 6n G 0* almost surely for all sufficiently large n. In particular, if 

G* = {9*}, then On — )■ 0* almost surely. 

Proof Let 0t = 6 \ 0^ By Theorem [2] and the finiteness of 9, supgt 4,(^) - C(f ) = 
— njinfet K*{9) — info K*{6)} + o{n) —00, almost surely. On the other hand, £(6'„) — 
^n{0*) > by definition of On and thus On ^ 0^ almost surely for all large n. □ 

While Theorem [3] is limited to finite 9, our empirical results Section suggest that 
the convergence can be extended to a compact 6. Toward this, we must show that Kn 
is either uniformly equi- continuous or convex, a task to be taken up in a future work. 

It is unclear whether results similar to Theorems |2] and [3] hold for the profile likelihood 
Un- Our proofs of these two theorems are based on martingale theory and rely on the 
fact that the summand \ogmi_i Q{Yi) in L^ni^) measurable with respect to the a-algebra 
generated by li, . . . , Vj. This does not hold for the summands logm„ in logL^(6'). 



3.3 Numerical illustration 

Suppose Yi, . . . ,Yn are modeled as independent observations from the mixture density 
'>T^f,a{y) = /o vhj I (J,u)f{u)du, where p{y \ a,u) is a N{u,a'^) kernel. The true model, 
however, is Yi = 0.5 + O.lZj, where Zi, . . . , Z„ is a random sample from the Student-t 
distribution with 5 degrees of freedom. The true density m{y) underlying the shifted and 
scaled Student-t model cannot equal mf^„{y) for any / with support [0,1] and cr > 0, 
because the tails of mf ,j{y) decay exponentially while those of decay polynomially. 
But despite this model mis-specification, Kn{(T) — )■ K*{cr) pointwise in a by Theorem [2J 
We approximate K{m,mf^cr) with its Gaussian quadrature form 

^ miyr) 



[ m{y) log ^^^\ dy ^ hrm{y,) log 



which is then numerically optimized over f{ui), . . . , f{uj) to obtain an approximation to 
K*{a). Here {{aj,Uj) : j = 1, . . . , J} are the Legendre node- weight pairs of order J = 101 
for the interval [0, 1], and {{K, yr) '■ r = 1, . . . , R} are the same for [—0.5, 1.5] with R = 
101. Figure [U shows Kn{o') and the limit K*{cr) for three choices of n, each replicated with 
100 independent data sets from the shifted and scaled Student-t distribution. In addition 
to showing pointwise convergence. Figure [1] also gives an indication of the conjectured 
convexity of Kn{cr). 
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n = 1000 n = 10000 n = 1e+05 




Figure 1: Gray lines show Kn{<^) for 100 independent data sets from the mis-specified 
mixture model in Section 13.31 Black lines show an approximation of the pointwise limit 

4 Examples 

4.1 Density estimation 

Consider density estimation where p{y \ a,u) = N{y \ u,a'^) is the Gaussian kernel with 
bandwidth a playing the role of 6 and the mixing density / assumed to have support 
[0, 1]. For the Bayes approach, the distribution function F is modeled as a draw from a 
Dirichlet process distribution with unit precision parameter and a uniform base measure 
on [0, 1]. For predictive recursion, we take /o to be a uniform density on [0, 1], matching 
the Dirichlet process base measure, and set Wi = (i + 1)"^/'^. Figure [2] shows i^^ (cr), 
and ^^^(cr) for 12 data sets of size n = 50 simulated from the Gaussian mixture 
model mf^cr{y) = J p{y \ a,u)f{u)du for several (o", /) pairs. The four different mixing 
distributions were chosen to capture various shapes and characteristics, incl uding discrete. 



continuous, and a mixture of each. The i mportance samplin g techn ique of iTokdar et al. 



(120091 ) is used to evaluate L^{(t); see also iMacEachern et al.l (119991 ). 



Figure [2] shows the normalized marginal likelihoods for the three different methods. 
It is clear that L^icr) closely approximates L^{cr) in all cases, while L^{(t) deviates ar- 
bitrarily, sometimes peaking at a = 0. The approximation of L^icr) by L^{cr) is robust 
against the smoothness and skewness properties of the underlying mixing distribution 
/. This is quite striking because the Dirichlet process formulation views / as a discrete 
distribution while predictive recursion is designed to recover smooth densities. We also 
note that both the approximate marginal and profile likelihoo d calculations a re ord ers of 



magnitude faster than those for Dirichlet process mixture; see iTokdar et al.l (12009! ) for a 
comparison of run times. 

4.2 Random- intercept regression models 



In this section we study two regression models along the lines of iTao et al.l (119991 ). In 
each case, we consider data on a response Y and a vector of predictors X G M"^ for n 
subjects each with r replicates. In our first study, y is a continuous variable and is linked 
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0.00 0.10 0.20 
a 



0.0 0.1 0.2 0.3 0.4 

CT 



0.0 0.1 0.2 0.3 0.4 

CT 




Figure 2: Solid lines show L^{cr), dashed lines show L^{cr), and dotted lines show L^^icr) 
for the density estimate example in Section 14.11 These curves have been normalized to 
integrate to 1. The true value of a is marked by a vertical gray line. The figure's four 
rows correspond to four different mixing distributions. The first is a Beta(2, 6) density; 
the second is a Beta(10, 30) density; the third is an equal mixture of point masses at 1/4 
and 3/4; the fourth is an equal mixture of Beta(2, 6) and a point mass at 3/4. 



9 



to the predictors through the random-intercept hnear regression model: 



Fi,- |X,,-~iV(f/, + X:/,a=^), (14) 

where i = l,...,n indexes subjects, and j = l,...,r indexes rephcates. We assume 
that the Yi/s are conditionally independent across both i and j. The subject-specific 
intercepts, the f/j's, are taken to be independent draws from a probability density / 
with respect to the Lebesgue measure on an interval ^ = [a,b]. Write 9 = {f3,a'^) as 
the unknown parameter of in terest. A related semiparametric Bayes model appears in 



Bush and MacEachernI (119961 ). 

To cast this regression model as the mixture model (j4]), we assume Xi = {Xn, . . . , Xir) 
are stochastic, sampled independently from a density g over x ■ ■ ■ x M"^. Then (fT4|) 
says that {Xi, Yj), where = (Yn, ■ ■ ■ , Yir), are independent observations from a density 
m = rrif^g as in (jl]) with a kernel 

y \ e,u) = p^{y \ 6, u)g{x), (15) 
where the conditional density of Y given x is 

1 1 ^ 

j=l 

With this setup, 6 can be estimated by maximizing the predictive recursion marginal or 
profile likelihood. The predictor density g need not be estimated: it drops out from 
the updating equation ([2j), and so does not affect fi^g. Consequently, mi^g{x,y) = 
^i,e,x{y)g{x), where nii^g^xiy) = f Px{y\d,u)fi^g{u) du does not involve g, and hence 
logL^'(6') incorporates g only through an additive constant. Also note that by factor- 
ing m{x,y) = mx{y)g{x), one can write K{m,mf^g) = j K{mx,mf^g x)g{x) dx where 
iTT'f,e,x{y) = / Px{y\d,u)f{u)du. Thus the {9*,f*) that characterizes the limiting asymp- 
totic properties of L^{0) minimizes an average KuUback-Leibler divergence Klm^jmf^g x) 
of the conditional densities weighted by g{x). 

Evaluation of L^{0) requires a single pass through the predictive recursion algorithm 
for each 6, which is computationally inexpensive, even for large n. Optimization, how- 
ever, requires several evaluations of LJ^' and its gradient. With computational efficiency in 
mind, we present a version of predictive recursion that also produces VL^ as a by-product, 
with no substantial increase in computational cost; see Appendix 2. This gradient algo- 
rithm, coupled with any packaged optimization routine, makes for fast semiparametric 
estimation of 6. 

For inference on 6, the exact sampling distribution of ^ in ([7]) will not be available 
in general, so some sort of approximation is needed. Here we propose a curvature based 
approximation, where the covariance matrix of 6 is estimated by the inverse Hessian 
H = {— V^£n(^)}~^, readily obtained from the output of the optimization routine. A 
stipulated 100(1 — a)% confidence interval for 9j can be obtained by taking 9j ± Za/2hjj, 
where Zg denotes the upper q^^ quantile of the standard normal distribution, and hjj is 
the j^^ diagonal element of H. 

Table [Ureports the performance of the predictive recursion-based marginal and profile 
likelihood estimates averaged over 500 datasets generated from model ( 1T4|) with n = 50 
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or 500, r = 4, d = 2, (5 = (2,5)', and cr = 2. The covariates Xjji's are independent 
iV(0, 1) and Xij2 = Ji + 0.1%, with J^'s satisfying P(Ji = 0) = P(Ji = 1) = 0.5 and 
Zjj's independent A^(0, 1). In this case, Xiji is a within-subject covariate while Xij2, 
which is roughly constant in j, acts like a subject-specific covariate. Three choices of 
/ are considered: iV(0, 2^), a shifted exponential distribution with rate 0.5 and support 
(—2, oo), and a discrete u n iform supported on ±2. Each choice of / has mean zero and 



variance 4. See iTao et al.l ( 1l999l ) for more details on these choices. For comparison to a 
parametric model fit, we also include a likelihood-based method that assumes the mixing 
distribution is Gaussian with unknown parameters. 

Performance of each estimate is measured by the root mean-square error in estimating 
each component of 9. We also include average coverage of a stipulated 95% confidence 
interval for each estimating method constructed as described above. Predictive recursion 
IS run with Wi = {l + ^ = Y±3Sy where Y and Sy are the mean and standard 

deviation of the full data {Yy}, and /o the uniform density over In this example, 
the predictive recursion-based marginal and profile likelihood methods perform similarly. 
Both are competitive with the parametric Gaussian method when / is indeed Gaussian, 
and are better for non-Gaussian /. The relative similarity between the marginal and the 
profile likelihood methods, unlike what we observed in the density estimation example 
before, can be explained by noting that for model ([HD, the data is informative about 
the parameter a due to availability of replicates. We note that although the coverage 
probabilities are generally close to the stipulated 95% level, there are some noticeable 
differences. First, the marginal and profile likelihood coverage probabilities for P2 are off 
the mark in the small n case. That Xij2 is partially confounded with the group structure 
is one potential explanation for this phenomenon. Second, in estimating a for the discrete 
uniform model, which lies in the boundary of F, the coverage falls dangerously low, even 
for large n. For such boundary cases, bootstrap confidence intervals might be more 
appropriate; see Section [51 

In our second example, we retain much of the above setting but consider a binary Y 
for which model (fT4l) is adapted to the a random-intercept logistic regression model: 

iogit{p(y,, = i)} = [/, + x:/, (16) 

where t/j's are independent draws from a uniform density / on ^ = [—8, 8] and 6 = 
Wi, P2) is unknown. This model corresponds to (jl]) through (fTSl) with an appropriate 
choice of Px{y\d, u). The last four columns of Table [1] reports the performance of maximum 
likelihood estimates of 9 based on the predictive recursion marginal and profile likelihood 
and the parametric Gaussian set, with same choices for /32, n, r, /, Wj, and {Xij} as 
in our first study. For n = 500, all methods perform similarly; for n = 50 the proposed 
marginal likelihood approach is better in terms of both estimation accuracy and coverage. 

The marginal and profile likelihood methods used above inherit the order- dependence 
characteristic of predictive recursion. That is, L^{0) and L^{0) depend on the order in 
which the data enter the recursion ([2]). The order is not important asymptotically, but 
finite sample behaviors may be sensitive to the specific order used. A simple remedy 
to this is to apply predictive recursion on M random permutations of the data and 
average the M resulting marginal and profile likelihood functions prior to optimization. 
Permutation-averaging in the two regression examples gave only marginally better results 
compared to what is shown in Table [H 
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Study I 








Study II 




/ 


Method 




RMSE 


C 


overaj 




RMSE 


Coverage 










a 


/3i 




a 


Pi 




Pi 


P2 




n = 5( 


] 




n = 


50 






Gaussian 


0.16 


0.60 


0.12 


95 


95 


94 


1.02 


2.48 


88 


86 


Gaussian 


Marginal 


0.16 


0.68 


0.12 


95 


86 


94 


0.61 


1.58 


95 


91 




Profile 


0.16 


0.71 


0.12 


94 


89 


92 


0.65 


1.78 


95 


90 




Gaussian 


0.17 


0.61 


0.11 


93 


94 


95 


0.86 


2.13 


90 


91 


Exponential 


Marginal 


0.17 


0.52 


0.11 


94 


91 


95 


0.64 


1.66 


96 


96 




Profile 


0.17 


0.50 


0.12 


94 


92 


94 


0.67 


1.76 


96 


96 




Gaussian 


0.15 


0.58 


0.11 


96 


95 


95 


0.94 


2.42 


87 


88 


Uniform 


Marginal 


0.14 


0.36 


0.11 


96 


97 


93 


0.82 


1.92 


92 


92 




Profile 


0.14 


0.34 


0.12 


96 


97 


90 


0.89 


2.14 


92 


93 




n = 500 




n = 


500 






Gaussian 


0.05 


0.18 


0.04 


96 


95 


95 


0.17 


0.43 


88 


85 


Gaussian 


Marginal 


0.05 


0.19 


0.04 


96 


94 


95 


0.15 


0.39 


93 


93 




Profile 


0.05 


0.19 


0.04 


96 


94 


95 


0.16 


0.41 


93 


95 




Gaussian 


0.05 


0.18 


0.04 


93 


96 


95 


0.15 


0.41 


90 


87 


Exponential 


Marginal 


0.05 


0.15 


0.04 


95 


94 


95 


0.15 


0.32 


96 


96 




Profile 


0.05 


0.14 


0.04 


95 


95 


94 


0.15 


0.34 


95 


95 




Gaussian 


0.05 


0.19 


0.04 


96 


94 


94 


0.18 


0.50 


88 


84 


Uniform 


Marginal 


0.05 


0.11 


0.05 


94 


95 


80 


0.20 


0.52 


91 


89 




Profile 


0.05 


0.11 


0.05 


95 


94 


84 


0.21 


0.54 


90 


86 



Table 1: Comparison of the predictive recursion-based marginal and profile likelihood 
methods along with a parametric Gaussian model for parameter estimation in the 
random-intercept regression models in Section 14.21 RMSE is root mean-square error. 
Coverage probabilities are multiplied by 100. Studies I and II denote the linear and 
logistic models, respectively. 
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4.3 Multiple testing in time series 

Let Y = {Y{t) : t = 1, 2, . . . , T} denote a discrete-time stochastic process under a first- 
order autoregressive model; i.e., 1^(0) is normal with mean and variance o"^(l — (p)^^, 
and 

Y{t)=^ + (l)Y{t-l) + ae{t), ~ iV(0, 1), t = l,2,...,T, 

for G M, (f) ^ ("Ijl) ^iid cr^ > 0. In other words, Y has a T-dimensional normal 
distribution with mean ^1^ = (^, . . . ,0' ^ ^'^'^ T x T covariance matrix of the 
form 

2 

= ^ = e C (0, oo) X (-1, 1). 

For a sample Yi, . . . ,Yn of similar processes, consider a mixture of simple first-order 
autoregressive models, namely, 

Y,\{^„U,)r^N{^dT,^uJ, Uir^fiu), ^i\9^9{0) + {l-9)N{0,l), (17) 

where independence is assumed throughout, and (0) denotes a degenerate distribution at 
0. It shall be further assumed that 6 is close to 1, so that Yi, . . . ,Yn are sparse in the 
sense that most have mean zero. The goal is to identify those processes with non-zero 
mean trajectories. 



Model f|T7|) is similar to that which appears in IScottI (j2009[ ). He considers a financial 
tii ne-ser i es app lication in which Yi{t) is related to the i^^ firm's return on assets for year 



t. IScottI f l2009l ) takes the Yi(t) to be the standardized residuals obtained when return on 
assets is regressed on several important factors. Therefore, firms whose F-process has zero 
mean are ordinary, those with non-zero me an are somehow extraordinary. The goal is to 



flag those extraordinary firms. IScottI (120091 ) considers a nonparametric Bayes model where 
/ is sampled from a Dirichlet process distribution, and the signal trajectories are constant 
zero with probability 6 or sampled from a Gaussian process with probability 1 — 6. Model 
f ll7p is a special case in which the signal trajectories are restricted to constant functions. 
But the assumption of constant paths is not critical. Indeed, the marginal likelihood 
procedure described below can be extended to handle signal trajectories with a relatively 
low-dimensional parametrization. 

From model f lT7|) . it is clear that Yi, . . . ,Yn are independent with common mixture 
density 

^fAy)= / / N{y\a,j:^){e{o){0 + {i-e)N{ao^)}dU{u)du 

J"i' J -oo 

= [ {^iV(y|0,S„) + (l-^)iV(l/|0,S„ + lTl^)}/( 



u) du. in 



Here, with the particular choice of parametric mixing distribution, ^ can be integrated 
out analytically, leaving only a mixture over u. Predictive recursion marginal likelihood 
is used to estimate 9 and / and, in turn, these estimates are used to classify the sample 
paths Yi, . . . 

If 9 and / were known, the Bayes oracle rule for classifying Yi as null or non-null is 
as follows: for 0-1 loss, conclude Yi is non-null if 

^ 9jN{Y,\Q,T.^)f{u)du ^ ^ ^ 
mf^0{Yi) ~ 
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Method Mean SD FDR MP 



Dense, 9 = 0.75 Plug-in 0.750 0.008 0.085 0.109 

Oracle — — 0.083 0.108 

Sparse, 6 = 0.95 Plug-in 0.949 0.004 0.082 0.026 

Oracle — — 0.073 0.026 



Table 2: Summary of inference in the autoregressive process mixture model simulations 
for the plug-in empirical Bayes and Bayes oracle methods. Mean denotes the simulation 
mean of 9 and SD is its standard deviation; FDR is the observed false discovery rate of 
the test and MP is its misclassification probability. 



Since 9 and / are unknown, we mimic the Bayes oracle classifier with the plug-in estimate 



9,. 



m 



(19) 



which is an estimate of the local false discovery rate (lEfronl l2004l . l2008l : ISun and Cai 
20071 ). and can be viewed as an empirical Bayes approximation of the posterior inclusion 
probabilities in IScottI (120091 ). there obtained by Markov chain Monte Carlo. 

For illustration, we simulate 100 datasets from the model fll7p . with n = 5000 and 
T = 50, and compare the performance of our empirical Bayes classifier f|T9l) to the Bayes 
oracle. We consider both a dense case, 9 = 0.75, and a sparse case, 9 = 0.95. In both 
cases the true mixing distribution f{u) = gi{(f))g2{cr'^) is a product of shifted and scaled 
beta densities. Predictive recursion marginal likelihood is applied, averaging over 25 
random permutations, to estimate 9 and a summary of the estimates for both the dense 
and sparse cases can be found in Table [2j We find that the estimates of 9 are unbiased 
with relatively small standard errors. Histograms, not displayed, show roughly symmetric 
distributions centered at the true 9. 

For the testing/ classification problem, we consider two measures of performance: false 
discovery rate and misclassification probability. Empirical estimates of these two quan- 
tities appear in Table [2] for both the plug-in and Bayes oracle rules, and both dense 
and sparse cases. The difference between false discovery rates is negligible in both the 
dense and sparse cases. Furthermore, the misclassification probability for the predictive 
recursion-based empirical Bayes classifier is just slightly higher than that of the Bayes 
oracle, suggesting that the latter mimics the Bayes oracle classifier very well. 



5 Discussion 



The focus in this paper is on frequentist semiparametric inference in mixture models 
with a predictive recursion-based approximation to a Dirichlet process mixture marginal 
likelihood. But a Bayesian version can be developed without much additional effort. In 
particular, information gathered from maximizing L^{9) in ([6]) can be used to construct 
efficient importance samplers to carry out posterior computation, i.e., the maximizer and 
Hessian matrix of L^{9) c an be used to construct Gaussian or heavy-tailed Student-t 
importance densities for 9 (lGewekelll989l ). 
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The choice of weights {wn : n > 1} for predictive recursion remains an important 
open problem. Convergence theory gives only minimal guidelines but, in our experience, 
the finite sample performance is relatively robust to the choice of weights. In this paper, 
we have employed the theoretically ideal weights Wi = {i + l)""^, with 7 = 2/3, based on 
the rate in Theorem [TJ An alternative is to let the exponent 7 be an additional tuniri g 
parameter to maximize the approximate marginal likelih ood over flTao et al.lll999h . 
It is not yet clear, however, if the convergence theorems of Martin and Tokdarl ( l2009l ) can 
cover data- dependent weight sequences. In the random-intercept regression problems of 
Section 14. 2[ the approach with estimated 7 gave similar results, not reported, to the that 
with fixed 7 = 2/3. 

The use of Hessian-based approximations of the sampling distribution of the predic- 
tive recursion marginal likelihood estimate 6 is based on a theory of asymptotic normality 
which is not yet available. An alternative to the Hessian-based approach is a basic boot- 
strap. Empirical results, not presented here, indicate that bootstrap-based confidence 
intervals have good coverage properties, but progress on the validity of the bootstra p 
for semiparametric problems has only just recently been made (jCheng and Huangil2010l ). 
Since \ogL"{9) is not an empirical processes, these first results do not directly apply in 
our context. It is our experience that both of these approaches for approximating the 
sampling distribution of On are successful, but we leave theoretical verification of their 
validity to future research. 
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Appendix 1: Proof of Theorem [2 

Fix 6 and define the sequence of random variables Zi = Zi{9) as 

m{Yi) N ■ ^ 1 

Zi = log ——- K{m,mi-i^g), i>l 

and note that E(Zj | = 0, where ^i-i = cr(l^, . . . , Yi_i). Therefore, {{Zi, s^^) : i > 

1} forms a zero mean martingale sequence. Next, let m^g be the mixture density closest 
to m in the KuUback-Leibler sense. Then we can write 

m{y) ^ 2 



E(Zf I < / (log ""^^ I m{y)dy 



m~i,e{y) 

log — - — + log — TT r ^(y) '^y 



2Ti + 2T2. 
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The second term, T2, is bounded by a constant B according to Assumption [61 For the 
first term, let = {y : mf£{y) < mj_i ^(y)}. By properties of the logarithm we get 



Ti 



log 



< 



log 



rn.f,e{y) J 



m{y) dy 
m{y) dy 



m{y) dy 



■^0 



log 



m{y) dy 



mi_ifi{y) / 



< 2 



rrifAy) ' ^m^i,0{y)j J 



< 2 + 2 sup 



Ui,U2 



p{y\ui,o) 

p{y\u2,6) 



m{y) dy, 



(20) 



where (120|) follows by two applications of Jensen's inequality, one to the mapping x H- 1/x 
and one to the mapping x 1— ?■ x^. The last term is bounded according to Assumption [51 
Therefore, ^{Zf \ £/i_i) is uniformly bounded by a constant M and, consequently, Vn = 

Set 6„ = n/cn, where Cn = cin or Cn = 1, depending on whether or not the conditions 
of the second part of the theorem hold. It is clear that 6„, grows faster than ra^/^, but no 
faster than n, which implies 



1/2 



(Mnloglog&„)V^ ^ ^ 



6„(loglog6„) V2 

Furthermore, by Markov's inequality, we have, with probability 1 



n=l 



> 



log log b„ 



,.sr^ (log log 
< M2_^ ^ < 00. 

n=l 



62 



(21) 



(22) 



In light of ([2ID and ([22D, it now follows from Corollary 2 of lTeichei] ([19981) that ^"^^ ^ 
almost surely. Therefore, we can conclude that, with probability 1, 

1 " I 

Kn{9)--J2K{m,mi.i,e)\ 
n . I 

1=1 

n 



0. 



(23) 



i=l 



If Cn = 1, then the result follows from Theorem [H and Cesaro's theorem. If c„ = a„, write 
Ki = K{'m,mi-i^g) — K*{6) and note that Hi > 0. Summation-by-parts and monotonicity 
of the weights Wi yield the following inequality: 



n-l 



1=1 1=1 1=1 j=l i=l i=l j=l 



(24) 



Let Si = 'Yl^j=iWjKj for i > 1. Lemma 4.4 of [Martin and Tokdad (l2009l ) shows that 
the limit 6*00 is finite almost surely. Dividing through ([2^ by n and applying Cesaro's 
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theorem shows that the right-hand side is positive and bounded by 5*00 almost surely for 
large n. This observation, together with fj23|) . implies an{Kn{0) — K*{9)} is almost surely 
bounded, proving the theorem. 



Appendix 2: Predictive recursion gradient algorithm 

Here we present a version of the predictive recursion that gives the gradient of ln{d) = 
logL^'(^) as a by-product. Let Xi{e) = mi_i^e{Yi); then Vin{0) = Er=i ^ log Ai(^). For a 
function g{6, u), the notation Vg{6, u) means the gradient of g with respect to 6', pointwise 
in u. 

As in the original algorithm, the user must be able to evaluate the kernel p{Yi \ 6, u) 
at each Yi for any pair {6,u). Furthermore, for this modification, the user must also be 
able to evaluate Vp{Yi \ 6,u). 

1. Start with the user-defined foM^u) and compute V fo^^u). 

2. For i = 1, . . . , n, repeat the following three steps: 

(a) Set g{e,u) = piYi \ e,u), Vg{e,u) = Vp{Yi \ e,u), and 

Gie, u) = g{e, u)Vf,^iM + V^?(^, u)f-_,^eiu). 

(b) Compute Ai(e) = / ^(^, ci/i(n) and V log Ai(^) = J G{e,u) dn{u)/Xi{e). 

(c) Update 

Ji,e[u) = (1 - Wi)ti-i,e[u) + Wi , 

^fifiiu) = (1 - Wi)W fi^i^eiu) + Wi| 1. 

3. Return fn,e{u), mn,e{y), and ZlLi ^ log -^il^)- 
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